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HARMONIC  REGRESSION* 


F.  J.  Anscombe,  Yale  University 

[Invited  address,  October  26,  1977,  to  the  Third  ERDA  Statistical  Symposium, 
at  Battelle  Pacific  Northwest  Laboratories,  Richland,  WA] 

The  calculations  of  ordinary  regression  analysis  — linear  regres- 
sion by  the  method  of  least  squares  — have  been  correctly  doable  for  a 
century  and  a half.  There  have  been  changes  in  computational  methods 
used.  There  is  plenty  to  discuss  about  regression  — is  it  appropriate 
for  the  data,  what  do  the  results  mean?  No  doubt  the  calculations  are 
sometimes  of  little  value,  but  sometimes  they  are  appropriate  and  lead 
to  new  understanding. 

Regression  analysis  of  time  series  has  a much  shorter  history. 

There  is  a good  deal  of  literature  about  it,  but  the  literature  often 
has  the  air  of  arm-chair  meditation  by  a non-participant.  My  concern 
has  been  to  Implement  principles  that  are  in  the  literature,  and  devise 
a working  procedure.  Various  practical  difficulties  have  been  encountered 
that  do  not  seem  to  be  discussed  in  the  literature. 

Does  anyone  need  to  do  regression  analysis  of  time  series?  Con- 
flicting opinions  are  heard.  Huge  amounts  of  time-series  material  are 
being  collected  and  stored  relating  to  the  environment  (weather,  pollution), 
the  observations  being  made  dally  or  even  more  frequently.  Many  economic 
series  are  developed  for  monthly  or  weekly  or  daily  activities.  I have 
worked  with  annual  series,  which  are  probably  the  least  satisfactory 
material  for  this  kind  of  study. 


*Prepared  in  connection  with  research  supported  by  the  Army,  Navy,  Air 
Force  and  NASA  under  a contract  administered  by  the  Office  of  Naval  Research. 
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Sorat;  broad  generalities  are  presented  below,  and  an  example  is  shown. 
The  details  are  vital,  but  as  they  have  been  fully  described  elsewhere 
they  are  not  given  here.  (See  the  references;  the  detailed  presentation 
Is  referred  to  as  Chap.  10.) 

Formulation 

We  consider  regression  of  one  "dependent"  variable  on  just  one 
"independent"  or  predictor  variable.  (Methods  extend,  of  course,  but 
not  without  some  difficulties,  to  several  predictor  variables.)  All 
means  will  be  supposed  zero.  Then  ordinary  linear  regression  can  be 
formulated  thus.  We  are  given  observations  on  pairs  of  variables, 

(Xi,  y^)  for  i ■ 1,  2,  ...,  m . We  suppose  that  for  all  i 

^i  " ^ 

where  the  errors  considered  to  be  (in  some  sense)  Independent 

of  each  other  and  of  the  predictor  variable  {x^^}  . The  method  of  least 
squares  can  be  equated  to  the  method  of  maximum  likelihood  when  we  sup- 
pose that  the  {e^}  are  independent  random  variables  identically  dis- 
tributed N(0,  o^)  . 

How  should  regression  of  time  series  be  formulated?  We  are  given 
series  {x^}  , {y^}  , where  t ■ 1,  2,  ...,  n . We  shall  not  suppose 

these  series,  nor  the  error  series  when  we  introduce  it,  to  con- 

sist of  Independent  elements.  We  shall  Instead  suppose  the  series  to 
be  stationary,  that  is,  realizations  of  some  kind  of  stationary  stochastic 
process.  (In  practice  the  appearance  of  stationarlty  with  zero  mean  is 
encouraged  by  subtracting  a linear  or  other  trend,  usually  after  taking 
logarithms.)  To  correspond  to  (1),  one  might  suggest 


y,.  - 0 + Ct  . 


But  if  the  series  are  related,  the  relation  may  be  not  simultaneous  |)|SJP!3>jT 

'Di?i 

One  might  have 

^ ^ S • ft 

for  some  Integer  lag  j . But  then  one  might  as  well  postulate 

^ • '^1  "t-j  * 't  • 

where  j runs  over  some  suitable  set  of  Integer  values.  This  seems  to 
be  the  appropriate  formulation  for  stationary  processes,  to  correspond 
to  (1)  for  independent  processes.  The  first  member  of  the  right  side 
of  (2)  represents  a linear  filtering  of  {x^}  . 

There  are  two  main  approaches  to  trying  to  estimate  the  parameters 
{6j}  of  the  filter  in  (2). 

Time-domain  methods 

One  can  try  direct  multiple  regression  of  {y^}  on  {x^}  and  on 
lagged  versions  of  it,  {x^  for  various  j . There  is  a difficulty 
about  deciding  how  many  lags  should  be  considered.  If  {x^}  is  strongly 
autocorrelated,  conditioning  will  be  poor.  An  accurate  representation 
of  the  relation  between  two  stationary  stochastic  processes  could  easily 
involve  a large  number  of  nonzero  coefficients  • 

If  our  reason  for  trying  to  fit  a relation  like  (2)  is  to  be  able 
to  forecast  y^  from  past  values  of  ^x^}  , possibly  a very  crude  esti- 
mate of  the  will  be  good  enough.  The  precision  of  a forecast  is 

limited  by  the  variance  of  the  error  term.  The  greater  precision  that 
would  be  attained  if  the  were  known  exactly  may  be  only  negligibly 

greater  (Cleveland,  1967).  Box  and  Jenkins  (1970,  1976)  have  presented 
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a set  of  practical  procedures  for  estimating  the  structures  of  time  series 
well  enough  for  forecasting.  If  our  purpose  Is  not  forecasting,  but  under- 
standing as  well  as  we  can  the  relation  between  the  series,  the  Box- 
Jenklns  methods  may  be  less  satisfactory. 

It  will  be  argued  that  some  of  these  difficulties  are  mitigated  or 
avoided  by  frequency-domain  methods.  However,  we  must  usually  be  alert 
to  temporal  Instability  or  change  in  a relation  like  (2),  and  that  will 
be  detected  by  time-domain  methods. 


Frequency-domain  methods 

The  Idea  is  to  Fourier-transform  the  relation  (2),  and  estimate  the 
transform  of  will  be  suggested  that  (1)  this  procedure  is 

easier  to  carry  out  than  multiple  regression  in  the  time  domain,  and 
that  (il)  the  results  are  easier  to  understand.  Claim  (i)  derives  from 
the  fact  that  the  first  member  on  the  right  side  of  (2),  the  filtering 
of  {x^},  is  a convolution  of  {6j}  and  {x^},  and  transforms  to  the 
product  of  the  separate  transforms  of  {6^}  and  {x^}  . Thus  (2) 
becomes 

FT{y^}  - (FT{6j})-(FT{Xj.})  + FT{e^}  . 


These  Fourier  transforms  are  complex-valued  functions  of  a real 
variable  X representing  frequency.  Consider  a narrow  frequency  band 
(Interval  for  X).  Suppose  that  in  this  interval  the  transform  of  {6j} 
were  (near  enough)  constant.  Then  in  this  Interval  the  relation  between 
fT{y^}  and  FT{x^}  would  be  exactly  like  relation  (1)  above  between 
j {y^}  and  {x^},  with  the  exception  that  the  variables  and  the  regres- 

sion coefficient  are  complex-valued.  In  real  terms,  FT{6j}  is  con- 
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veniently  expressed  as  an  amplitude,  the  gain  function  G(A)  , and  an 
angle,  the  phase-shift  function  <ti(X)  . Thus  if  G(A)  and  i)>(X) 
could  be  regarded  as  constant  over  the  frequency  band,  they  could  be 
estimated  from  the  transforms  of  ^ slight  modifica- 

tion of  the  usual  procedure  for  the  linear  regression  relation  (1)  — 
expressed  in  real  terms  it  looks  a bit  different,  but  the  procedure  is 
really  ordinary  linear  least  squares  with  two  real  coefficients  to  be 
estimated.  The  least-squares  procedure  is  particularly  appropriate  if 
the  error  process  is  a stationary  Gaussian  process  whose  spectral 

density  is  nearly  constant  over  the  band. 

However,  it  has  been  generally  recognized  (Akalke  and  Yamanouchi, 
1962,  Cleveland  and  Parzen,  1975,  and  other  writers)  that  treating 
(>(X)  as  constant  is  not  satisfactory  when  its  derivative  is  much  dif- 


ferent from  0,  and  that  it  is  better  to  approximate  the  behavior  of  the 
transform  of  {6^}  in  the  narrow  frequency  band  by  three  real  parameters, 
the  average  values  of  G(X)  , <^(X)  and  <)>'(X)  in  the  band  — that  is. 


treat  G(X)  as  constant  and  41  (X)  as  linear  in  X . Now  the  regression 


procedure  is  further  modified,  becoming  in  fact  nonlinear  and  requiring 
an  iterative  solution,  but  still  computationally  rather  easy. 

Thus  the  complete  procedure  Involves  examining  the  frequency  range 
of  X in  bands,  using  a moving  "window",  and  in  each  band  doing  a small 
computation  to  determine  three  real  parameters,  representing  average 
values  of  G(X)  , 4(X)  , ()>'(X)  . Upon  putting  the  solutions  together 
we  see  the  whole  behavior  of  G(X)  and  <4(X)  . With  G(X)  and  ^(X) 
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Intelligibility 

Claim  (11)  Is  that  G(X)  and  (^(A)  are  what  we  need,  In  order  to 
understand  the  relation  between  {y^}  and  {x^}  , rather  than  the 

. If  the  latter  were  given,  we  should  have  to  Fourier-transform 
them  to  see  qualitatively  the  effect  of  the  filter.  Compare  with  the 
usual  commercial  description  of  performance  of  an  amplifier  In  a sound- 
reproduction  system. 


Example 

As  an  example  of  methods,  we  try  interrelating  an  annual  series  of 
total  copper-mine  output  for  the  U.S.  and  two  economic  annual  series, 
one  giving  the  New  York  price  of  copper,  the  other  the  total  dollar 
value  of  imports  of  merchandise  into  the  U.S.  The  copper  price  series 
is  thought  to  reflect  the  world  supply  and  demand  for  copper.  Changes 
in  price  might  be  expected  to  lead  to  similar  changes  in  production, 
possibly  a little  later.  The  imports  series  is  taken  as  an  indicator 
of  the  U.S.  economy.  The  copper  production  series  is  N233,  and  the 
price  series  is  N241,  in  Historical  Statistics  of  the  United  States  (1975); 
the  production  figures  run  from  1845  to  1970,  the  prices  from  1850  to 
1970.  The  figures  have  been  taken  exactly  as  published,  except  that  to 
smooch  a change  in  price  definition  in  1968  the  average  of  two  defini- 
tions has  been  used  for  1967.  The  price  figures  for  1850-1859  are  of 
uncertain  meaning,  and  the  production  figures  for  before  1860  show  a 

more  rapid  proportional  rate  of  growth  than  later.  For  present  purposes 

« 

it  has  seemed  wise  to  ignore  the  pre-1860  data.  Continuation  of  the 
series  from  1970  to  1975  has  been  obtained  from  the  Statistical  Abstract 
of  the  United  States  (editions  for  1975  and  1976).  The  two  series  are 
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reproduced  in  Figure  1,  except  that  the  last  two  digits  of  the  production 
entries  have  been  dropped  for  ease  of  reading.  The  imports  series  has 
been  given  in  Chap.  10  and  is  not  reproduced  here;  only  the  portion  from 
1860  to  1975  is  used. 

Figure  2 shows  a plot  against  the  date  of  the  logarithm  of  the  pro- 
duction series,  with  linear  regression  on  date  subtracted.  Figure  3 is 
a similar  plot  for  the  price  series.  A plot  for  the  imports  series  has 
been  given  in  Chap.  10. 

The  three  given  series  have  each  116  entries  (for  1860-1975).  To 
prepare  them  for  Fourier  analysis  they  have  been  prewhitened  by  the 
steps:  (i)  take  logarithms,  (ii)  subtract  the  linear  regression  on 

date,  (Hi)  filter  by  the  two-point  filter  with  weights  (-0.9,  1).  The 
last  operation  reduces  the  length  of  each  series  to  115.  Then  the  series 
have  been  circularized  (tapered)  by  linearly  splicing  the  first  7 and 
the  last  7 entries,  so  that  the  length  of  each  series  becomes  108.  The 
Fourier  transform  is  made  at  frequencies  (0,  1,  2,  ...,  54)/108  cycles 
per  year;  the  transform  is  expressed  as  a set  of  (real)  coefficients  of 
cosine  and  sine  terms,  or  alternatively  as  a set  of  squared  amplitudes 
and  phase  angles.  The  frequencies  are  referred  to  as  harmonics,  num- 
bered 0 through  54. 

The  first  step  to  perceiving  an  interrelation  between  any  pair  of 
series  is  to  plot  the  difference  of  phase  angles  at  each  harmonic  against 
the  harmonic  number.  Figure  4 shows  this  for  the  production  and  price 
series,  and  Figure  5 for  the  production  and  imports  series.  At  each 
harmonic,  the  product  of  the  amplitudes  is  classified  by  size  into  one 
of  six  categories  and  represented  by  one  of  the  plotting  symbols: 
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Each  phase  difference  is  plotted  with  this  symbol  twice  over  in  the 
interval  from  0 to  8 rightangles.  In  looking  for  trends,  the  viewer’s 
eye  should  be  guided  by  the  heavier  symbols. 

Figure  5 shows  a fairly  strong  relation  between  production  and 
Imports,  especially  at  the  higher  frequencies  — the  phase  differences 
are  mostly  rather  close  to  4 (or  0 or  8)  rightangles,  and  show  no  trend 
with  frequency.  A simultaneous  positive  correlation  between  these  two 
series  is  indicated.  Figure  4 shows  a less  clear  relation  between  pro- 
duction and  price.  At  lower  frequencies  there  is  some  suggestion  of 
trend  in  the  phase  differences,  implying  that  production  follows  price, 
possibly  by  two  years,  possibly  by  four.  At  higher  frequencies  the 
phase  differences  seem  very  scattered.  Not  reproduced  is  a phase  dif- 
ference plot  for  the  price  and  imports  series,  suggesting  quite  a strong 
simultaneous  correlation  at  the  lower  frequencies,  and  not  much  at  higher 
frequencies. 

Now  the  regression  calculation  in  frequency  bands,  to  estimate 
G(X)  , i}i(X)  and  ((>'(X)  , can  be  performed.  The  window  chosen  is  23 
harmonics  wide,  and  sine  weights  have  been  used.  The  results  are  tabu- 
lated in  Figure  6.  The  first  column  lists  the  harmonic  number  of  the 
central  frequency  in  the  band;  we  have  stepped  the  central  harmonic 
number  from  the  lowest  possible  value,  11,  by  unit  steps  to  the  greatest 
possible  value,  43.  (Had  there  been  many  more  harmonics  and  a greater 
bandwidth,  greater  steps  would  have  been  convenient.)  The  next  three 
columns  list  estimates  of  spectral  density  for,  respectively,  copper  pro- 
duction, copper  price  and  Imports  (prewhitened  as  explained  above), 
obtained  from  the  raw  line  spectra  by  the  23-point  sine-weighted  moving 
average.  The  next  four  columns  refer  to  regression  of  copper  production 
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on  copper  price.  They  list  average  values  in  the  band  of  G(A)  , <p(X) 

. 2 

and  (p  (X)  , and  (in  the  fourth  of  these  columns)  multiple  R (the 

2 

coherency).  The  behavior  of  4>(A)  and  of  R gives  a numerical  measure 

of  the  trend  seen  in  Figure  4.  The  last  four  columns  of  Figure  6 give 

similar  information  for  regression  of  copper  production  on  imports,  and 

relate  to  Figure  5.  (Simultaneous  regression  of  copper  production  on 

both  copper  price  and  imports  is  not  considered  at  this  point.) 

To  test  a null  hypothesis  of  no  association  between  series,  5%,  1% 

2 

and  0.1%  values  for  R for  any  given  frequency  band  are  estimated  (by 
a crude  argument)  at  0.19,  0.27  and  0.36,  respectively  — these  values 
probably  err  in  being  a little  too  low.  The  tabulated  values  are  very 
highly  correlated,  as  one  reads  down  the  column.  So  for  regression  of 
production  on  price,  it  seems  reasonable  to  claim  a substantial  correla- 
tion at  low  frequencies,  in  the  bands  centered  between  the  11th  and 
19th  harmonics.  For  regression  of  production  on  Imports,  the  correla- 
tion is  substantial  in  bands  centered  between  the  23rd  and  43rd  har- 
2 

monies  — R is  close  to  0.5  in  many  of  these  bands. 

Of  our  two  predictor  variables,  copper  price  and  general  Imports, 
the  latter  has  on  the  whole  the  greater  correlation  with  copper  produc- 
tion. But  the  two  predictor  series  have  some  correlation  with  each 
other.  How  useful  is  the  price  series  as  a predictor  in  conjunction 
with  the  imports  series?  Residual  Fourier  transforms  of  the  production 
series  and  of  the  price  series,  after  regression  on  the  Imports  series, 
can  be  obtained,  and  a phase-difference  plot  can  be  made,  analogous  to 
Figure  4 for  the  original  Fourier  transforms.  This  is  shown  in  Figure  7. 
The  phase  trend  seems  rather  similar  to  that  in  Figure  4 at  lower  fre- 
quencies, and  weaker  at  higher  frequencies. 
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Flgure  8 shows  a calculation  like  that  in  Figure  6,  but  relating 

to  simultaneous  regression  of  production  on  both  price  and  imports, 

2 

instead  of  to  separate  regressions.  The  R in  the  final  column  is 

2 

always  greater  than  either  value  of  R (for  the  same  frequency  band) 

2 

given  in  Figure  6.  The  most  striking  increase  over  the  R for  regres- 
sion on  imports  only  occurs  for  bands  centered  between  the  25th  and  28th 
harmonics  — 0.479  Instead  of  0.332  at  the  25th  harmonic,  0.511  instead 
of  0.359  at  the  26th  harmonic,  etc.  The  same  sort  of  crude  argument  as 
before  indicates  that  these  four  increases  (but  none  of  the  others)  can 
be  regarded  as  significant  at  the  5%  level.  The  increases,  on  the 
whole,  are  larger  at  lower  frequencies  than  at  higher  frequencies. 

The  two  phase-shift  functions  estimated  in  Figure  8 can  be  fairly 
well  approximated  at  most  frequencies  by  saying  that  production  is 
correlated  positively  with  imports  of  the  same  year  and  negatively  with 
prices  of  four  years  before. 

Figures  9 and  10  are  time-domain  plots  Intended  to  show  whether  the 
relations  between  the  series  perceived  in  the  harmonic  analysis  pervade 
the  whole  series  or  are  special  to  particular  epochs.  For  both  plots, 
the  original  series  have  been  transformed  to  logarithms  and  a linear 
trend  has  been  subtracted.  Then  for  Figure  9,  low  frequencies  have 
been  suppressed  by  taking  the  second  difference  of  the  series,  and  the 
resulting  production  values  are  plotted  against  the  imports  values.  The 
correlation  coefficient  is  0.60.  The  decade  of  each  plotted  point  is 
shown  by  the  letters  appearing  on  the  right  side  of  Figure  1;  a star 
means  that  two  or  more  points  have  coincided.  For  Figure  10,  the  spectra 
have  been  roughly  whitened  by  taking  the  first  difference  of  each 
series,  and  then  high  frequencies  have  been  suppressed  by  three  simple 


2-point  averagings.  The  first  4 values  of  the  resulting  production  series 
and  imports  series  have  been  dropped,  and  the  last  4 values  of  the  result- 
ing price  series;  and  then  the  production  values  are  plotted  against  the 
linear  combination  of  the  Imports  values  (for  the  same  year)  minus  0,8 
times  the  price  values  (for  four  years  earlier).  The  correlation  coef- 
ficient is  0.53.  The  decade  of  the  production  values  is  shown  as  before. 

The  pronounced  correlation  in  both  Figures  9 and  10  is  due  to  a few 
extreme  points  labeled  G or  H , representing  the  two  decades  from 
1920  to  1939,  If  all  points  for  these  decades  were  omitted,  the  correla- 
tion would  become  0.04  for  Figure  9 and  -0.08  for  Figure  10,  That  is, 
the  correlation  would  disappear. 

Discussion 

Harmonic  regression  is  a systematic  way  of  looking  for  association 
between  series  in  all  parts  of  the  frequency  range.  It  is  unlikely  to 
reveal  anything  that  cannot  be  found  by  careful  visual  comparison  of 
plots  such  as  those  in  Figures  2 and  3,  at  least  when  only  two  or  three 
series  are  under  consideration.  (A  similar  remark  can  be  made  about 
ordinary  regression.) 

We  have  found  clear  evidence  of  association  between  copper  produc- 
tion and  general  imports,  at  higher  frequencies,  and  some  suggestion  of 
predictive  value  for  copper  price  also,  at  middle-to-low  frequencies. 

What  associations  there  are  seem  to  inhere  in  the  economically  turbulent 
years  of  the  20*3  and  30's.  We  do  not  see  similar  associations  in  the 
other  decades.  Possibly  relations  between  these  series  are  changing, 
possibly  the  phenomena  are  highly  nonlinear. 
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Smoothed  spectral  1 Regression  coefficients  of  copper  prodn,  with  R*2, 
estimates  I (l)  on  copper  price  (ii)  on  imports 
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